Natural Language Processing Project
Finetuning Mistral 7B with QLoRA for KG construction.
This Project was developed for the Natural Language Processing course in the Master Degree in Data Science.
Building Knowledge Graphs from unstructured text has traditionally required complex pipelines with multiple specialized models - a significant barrier for many developers and organizations. I set out to create a more accessible solution by fine-tuning a Large Language Model (LLM) for this specific task.
I developed a novel approach to Automatic Knowledge Graph Construction (AKGC) using Mistral 7B, a state-of-the-art open-source LLM. The key innovation was allowing users to specify their desired Knowledge Graph schema through prompts, giving them control over what information to extract and how to structure it. To make this possible at scale, I implemented QLoRA for efficient fine-tuning on a single GPU.
The project involved creating a comprehensive synthetic dataset from six Neo4j graph databases, carefully selected for their quality and real-world applications. This data was used to train and evaluate different approaches to AKGC. The results were compelling: the fine-tuned model achieved 81% precision and 77% recall on known graph schemas, significantly outperforming the base model. More importantly, it demonstrated superior zero-shot capabilities on entirely new schemas, showing true generalization of the task.
Through rigorous experimentation, I uncovered crucial insights about AKGC with language models. While zero-shot approaches excel for single-document extraction, maintaining consistency across multiple sources requires few-shot examples to standardize entity formatting. This led to developing an improved version optimized for few-shot learning, which outperformed both the base and initial fine-tuned models.
One particularly interesting finding was that early stopping during fine-tuning (around 50 steps) produced better generalization to new schemas. This suggests that longer training causes the model to overfit to specific database structures rather than learning generalizable AKGC skills.
The final implementation provides an accessible alternative to traditional AKGC pipelines, making structured knowledge extraction more approachable for users with limited resources. Beyond the technical implementation, the project contributed valuable insights to the field and opened new research directions, particularly around creating more diverse training datasets and optimizing fine-tuning parameters.
This project demonstrates my ability to tackle complex NLP challenges, implement cutting-edge machine learning techniques, and develop practical solutions that make advanced technology more accessible to the wider development community.